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ABSTRACT 

Conjugation of DNA through a type IV secretion sys- 
tem (T4SS) drives horizontal gene transfer. Yet little is 
known on the diversity of these nanomachines. We 
previously found that T4SS can be divided in eight 
classes based on the phytogeny of the only ubiq- 
uitous protein of T4SS (VirB4). Here, we use an ab 
initio approach to identify protein families systemat- 
ically and specifically associated with VirB4 in each 
class. We built profiles for these proteins and used 
them to scan 2262 genomes for the presence of T4SS. 
Our analysis led to the identification of thousands 
of occurrences of 116 protein families for a total 
of 1623 T4SS. Importantly, we could identify almost 
always in our profiles the essential genes of well- 
studied T4SS. This allowed us to build a database 
with the largest number of T4SS described to date. 
Using profile-profile alignments, we reveal many new 
cases of homology between components of distant 
classes of T4SS. We mapped these similarities on 
the T4SS phylogenetic tree and thus obtained the 
patterns of acquisition and loss of these protein fam- 
ilies in the history of T4SS. The identification of the 
key VirB4-associated proteins paves the way toward 
experimental analysis of poorly characterized T4SS 
classes. 

INTRODUCTION 

Prokaryotes have the ability to adapt quickly by acquiring 
genes from other prokaryotes (1-3). Conjugation, which 
is one of the major mechanisms of gene transfer, requires 
cell-to-cell contact and is able to deliver the whole genome 



of one cell into another. Conjugation-specific proteins are 
found in all major taxa of prokaryotes, even though exper- 
imental evidence is still mostly restricted to Proteobacte- 
ria and Firmicutes (4-8). The most frequent mechanism of 
DNA conjugation involves the passage of single-stranded 
DNA (ssDNA) from the donor cell to the recipient, upon 
which replication re-establishes double-stranded (dsDNA) 
copies in each cell (7). This mechanism relies on three major 
components: a relaxosome, a coupling protein (T4CP) and 
a type IV secretion system (T4SS). The relaxosome includes 
a protein essential for conjugation — the relaxase (MOB) — 
that nicks the dsDNA and binds the resulting ssDNA at the 
origin of transfer (see ( 7,9) for reviews). The relaxase bound 
to the ssDNA molecule is coupled to a T4SS by a T4CP and 
translocated through the donor membrane(s) to the cyto- 
plasm of the recipient. Two different coupling proteins have 
been identified: VirD4 and TcpA. TcpA is found within cer- 
tain systems of Firmicutes, and is more closely related to 
FtsK, a protein involved in chromosome segregation, than 
to VirD4 (10,1 1). VirD4 is associated with the vast majority 
of T4SS and probably originated from an ssDNA translo- 
case (12). Some mobile genetic elements encode a relaxase 
and occasionally a T4CP but no T4SS. These elements are 
very abundant in bacterial genomes and are called 'mobi- 
lizable' because they use a T4SS encoded in trans. Most 
T4SS are thought to be involved in conjugation (nucleo- 
protein secretion), but some are specialized in protein secre- 
tion, allowing the delivery of effector proteins to the cytosol 
of eukaryotic organisms. These T4SS typically lack a relax- 
ase (MOBless T4SS), but require a T4CP (see (5) for excep- 
tions). Several systems are able to deliver both the DNA- 
bound relaxase and other protein effectors (13-15). There 
are also examples of T4SS involved in DNA import (the 
ComB system of Helicobacter pylori (16)) or in DNA secre- 
tion (the GGI system of Neisseria gonorrhoeae (17)) associ- 
ated with natural transformation. T4SS are thus remarkably 



To whom correspondence should be addressed. Tel: +33 1 57 27 77 45 Email: jgugliel@pasteur.fr 
Present address: 

Julien Guglielmini, Infection, Antimicrobiens, Modelisation, Evolution, INSERM UMR1137, Paris 75018, France. 
© The Author(s) 2014. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by- 
nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 



5776 Nucleic Acids Research, 2014, Vol. 42, No. 9 



flexible nanomachines adapted to translocate large macro- 
molecules through multiple cell membranes. 

The plasticity of T4SS results in a diversity of systems, 
most of which are yet poorly characterized. One complica- 
tion in the study of T4SS is the lack of a standard nomencla- 
ture: genes of similar names are not necessarily homologs 
and homologs do not have necessarily the same names. 
Here, we follow the convention that mating-pair formation 
(MPF) genes are labeled with the name of the associated 
mobile element, and protein profiles are labeled with the 
name of the MPF class. For example, TraB F is the protein 
TraB encoded by plasmid F and TraB M pFF is the protein 
profile for the protein family including TraBp of MPFp. The 
only exception concerns the VirB proteins, which by default 
concern the T-DNA transfer virB system of Agrobacterium 
tumefaciens, and whose names are in general used without 
ambiguity (i.e. similar names correspond to homologs). We 
use the VirB system, composed of 11 genes from VirBl to 
VirBl 1 , as a model because it is by far the best characterized 
T4SS (9,18) (Figure 1). The core secretion channel complex 
of the VirB T4SS (including VirB7, VirB9 and VirB 10) that 
spans the periplasm and both cell membranes is thought to 
be the first to assembly. VirB 10 lines the inner surface of the 
core complex chambers, and spans the whole length of the 
core complex, forming the bottom ring of the inner mem- 
brane layer. The protein VirB9 forms the outer sheath of 
the core complex that interacts and is stabilized by VirB7, 
a small lipoprotein (9,19). The three proteins VirB3, VirB6 
and VirB8 are thought to join the core complex to pro- 
duce the inner-membrane pore (20). VirB6 is a polytopic 
membrane protein with a number of transmembrane do- 
mains (TMDs) and a large central periplasmic loop (21). 
The role of VirB3 is not yet clear, even if the protein is essen- 
tial for pilus assembly and substrate translocation. In some 
systems, VirB3 is fused to VirB4, which suggests a strong 
functional link between the two (22). Three AAA+ ATPases 
(VirB4, VirBl 1 and the T4CP) join the assembled complex, 
one of which (VirB4) is implicated in energizing pilus bio- 
genesis (23). The pilus is composed of a major and a minor 
pilin (respectively VirB2 and VirB5). Finally, VirBl is a non- 
essential transglycosylase that degrades peptidoglycan and 
thus facilitates T4SS assembly across the cell wall (24,25). 

In a recent study, we found that the phylogeny of VirB4, 
the only ubiquitous protein with recognizable homologs in 
all known T4SS, is divided in eight large robust clades that 
correspond to eight MPF classes (11). Based on these re- 
sults we proposed an evolution-aware classification of T4SS 
that shows strong associations with prokaryote's systemat- 
ics and, to some extent, to the structure of the cell enve- 
lope (11). The association between T4SS composition and 
cell envelope structure has been recently reviewed (20). It 
should be stressed that within a given MPF class there is 
also co-evolution between the composition of the mem- 
brane and the T4SS (26). Four MPF classes encompass the 
conjugation systems of Proteobacteria and closely related 
taxa: VirB-like systems are the most numerous (MPFj); F- 
like systems are particularly abundant in plasmids of 7- 
Proteobacteria (MPF F ); R64-like systems are much rarer 
(MPFi); and ICEHinl056-like systems are almost exclu- 
sively found as integrative elements (ICE) in Proteobacte- 
ria (MPFq) (6,8,27). Two MPF classes are much less well 



known and include systems present only in Cyanobacteria 
(MPF C ) or in Bacteroidetes (MPF B ). Finally, we delimited 
two different MPF classes in monoderms (organisms de- 
void of an outer membrane). One class is found in Firmi- 
cutes and Actinobacteria (MPF FA ) whereas the other is also 
found in Tenericutes and Archaea (MPF F ata)- 

T4SS are very diverse: 10 of 11 virB genes are essential 
(24), whereas the conjugative system of R64 is encoded by 
49 genes of which 23 are essential for solid mating plus 
12 for liquid mating (28). The exact number of genes and 
their essentiality for T4SS assembly or function is unknown 
in many classes of T4SS. Some systems of Firmicutes are 
thought to lack pili altogether (4), e.g. the T4SS of the 
pCFlO plasmid of Enterococcus faecalis, the pGOl plasmid 
or the ICE TnGBS of Streptococcus encode adhesins that 
stabilize the mating process. In taxa such as Archaea, Acti- 
nobacteria or Cyanobacteria, there are very few reports on 
the mechanisms of ssDNA conjugation, and none, to the 
best of our knowledge, on the T4SS structure and composi- 
tion. Yet these taxa encode homologs of VirB4 and VirD4 
in their mobile genetic elements suggesting the presence of 
T4SS-mediated conjugation (8). 

Here, we took advantage of our previous dataset of VirB4 
homologs to characterize all MPF classes. The repertoires, 
diversity and evolution of relaxases have been recently re- 
ported (29). Therefore, we focused our attention on T4SS. 
To establish the repertoire of protein families typical of each 
class of T4SS, we detected the families of genes systemat- 
ically associated and co-localized with VirB4 in the eight 
MPF classes. With these families we built protein profiles 
that we used to detect and classify T4SS. This resulted in 
a much more complete database of T4SS than those cur- 
rently available. Using profile-profile alignments we identi- 
fied distant homologies between protein families of differ- 
ent MPF classes thereby providing the first large-scale sys- 
tematic analysis of protein homology between all different 
MPF classes. Finally, we built a web resource, called CON- 
Jdb, that presents all our data in a searchable and compre- 
hensive manner. 

MATERIALS AND METHODS 

Data 

Data on complete prokaryotic chromosomes and plas- 
mids were taken from GenBank Refseq (ftp://ftp.ncbi.nih. 
gov/genomes/Bacteria/, last accessed February 2013). The 
dataset of 2262 complete prokaryotic genomes comprised 
2393 chromosomes and 1813 plasmids. We used the annota- 
tions of the GenBank files, having removed all pseud ogenes 
and proteins with inner stop codons. Some protein profiles 
were taken and improved from our previous work (8). 

Construction of protein profiles 

Figure 1 describes the procedure used to construct the pro- 
tein profiles. We searched for genes encoding TraU/VirB4 
in the genomes using HMMER (see below). We then gath- 
ered the 20 genes on each side of each traU/virB4 gene. This 
resulted in seven sets of proteins, one for each T4SS class 
(except for the previously analyzed MPF T ). We made all- 
against-all BLASTP searches in each set (default settings) 
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Figure 1. Procedure used to create mating-pair formation (MPF) protein profiles. We used all the genes found within a frame of — 20/+20 genes around 
the VirB4 proteins of each clade (named MPFi, MPF C , MPF G , MPF T , MPF F , MPFb, MPF fata and MPF FA — see text for details) of its phylogeny (1 1). 
Numbers on the phylogeny correspond to bootstrap values, from (11). With these seven datasets of proteins (one for each class, except for MPFx for which 
we already had the protein profiles) we performed all-versus-all BLASTP and used the scores to build protein families. We made protein alignments of 
these families, and kept the ones that give hits within the class. Thumbnail: scheme of the virB system. 



and used the output to build protein clusters using SiLiX 
(30) (identity >30% and overlap >50%). We carried out 
a multiple alignment of the proteins in each cluster with 
more than five proteins using MUSCLE (31) (default pa- 
rameters). We used the multiple alignments to build phy- 
logenetic trees using PHYML (32). With these two pieces 
of evidence we removed the very few cases of extreme di- 
vergence, the proteins that were too short and the proteins 
that were too long (typically false positives, fusions or fis- 
sions of proteins motivated by sequencing errors or pseudo- 
genization). Then, we re-built multiple alignments of the se- 
lected proteins with MUSCLE, checked manually the align- 
ments and trimmed them to remove poorly aligned regions 
at the edges, if relevant. Finally, we used HMMER 3.0 (33) 
to build protein profiles from the manually curated multiple 
alignments. 

It should be pointed out that homologous proteins with 
very little sequence similarity might escape the BLASTP de- 
tection used in the clustering procedure if a protein does not 
find a single hit in a family. However, the use of a method 
allowing detection of these homologs would probably not 
help at this stage, since the protein profiles must be built 
from proteins providing reliable multiple alignments. For a 
few very divergent proteins this may lead to their exclusion 
from the analysis. For families aligning poorly with other 
sub-families, our method will provide several independent 



profiles. In this case, the profile-profile alignment procedure 
will identify the homology between families. 



Identification and analysis of MPF protein profiles 

To identify the profiles corresponding to MPF proteins 
within all the protein profiles, we performed hidden Markov 
model (HMM) searches on the genome data. We used HM- 
MER to identify components of the T4SS. We kept the hits 
showing an i-e-value < 0.01 and a coverage of the protein 
profile higher than 50%. Some T4SS genes may be more 
than 20 genes apart from the VirB4 homolog (even if we 
could not find a single such occurrence in the model sys- 
tems). The proteins encoded by these genes will not be used 
to build the protein profiles. Yet these proteins will be iden- 
tified by these profiles when we scan the genomes. We will 
keep them for further analysis when more than 50% of the 
hits of the protein family are located in a neighborhood of 
— 20/+20 genes around a traU/virB4 gene, i.e. as long as the 
elements of the protein family are not systematically distant 
from VirB4. All genes regarded as essential components of 
the T4SS were in this situation. Then, we applied the fol- 
lowing procedure on the remaining HMM profiles. We com- 
pared all pairs of profiles with each other and with the ones 
of the Pfam 26 database (13 672 protein families) using HH- 
search (34) (p < 0.001 threshold) (35). As common usage 
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in comparative genomics, the pairs of significant hits were 
regarded as homologs. We cannot totally exclude the pos- 
sibility that the very low similarity between some families 
could be due to convergence. We removed from further anal- 
ysis the profiles whose Pfam annotation was clearly related 
to functions other than conjugation. We inspected the co- 
localization patterns of the hits of the remaining profiles. We 
removed protein profiles that gave hits systematically dis- 
tant (more than five ORFs) from the others. The resulting 
set of profiles was used to scan the genomes and build CON- 
Jdb. 



RESULTS 

Characterization of T4SS protein families 

We built a procedure to identify T4SS based on two well- 
established features (5). (i) Genes encoding the T4SS are 
generally grouped together in one or a few operons. (ii) 
VirB4 is the only protein family identified in all functional 
T4SS. We identified the occurrences of virB4 in 2269 com- 
plete prokaryotic genomes using a previously defined pro- 
tein profile (see the Materials and Methods section). These 
proteins were found in all major taxa. We then fetched the 
genes in the genomic neighborhood of each of the 1623 
virB4 genes. Pairwise similarity searches followed by clus- 
tering and curation resulted in 652 families. The multiple 
alignments of each family were used to build protein pro- 
files (HMMs), which were applied to scan the genomic data. 
Some T4SS components may be more than 20 genes apart 
from virB4. This is the case of T4SS encoded in several loci 
scattered in the genomes of Rickettsiales (36). These pro- 
teins were not used to build the protein profiles. Neverthe- 
less, they were subsequently identified in the step of genome 
scanning with the protein profiles. They are therefore in- 
cluded in the analysis of T4SS. Our method has no phy- 
logenetic bias, i.e. we do not a priori restrict protein fam- 
ilies associated with a VirB4 class to a given taxonomic 
group. However, most studied T4SS are from Proteobacte- 
ria and Firmicutes and this may lead to two inevitable im- 
plicit biases in the analysis. Firstly, analogous, not homol- 
ogous, proteins may fill the same function in different taxa 
and might be missed if there are few representatives of the 
taxa or if these genes are not encoded systematically close to 
virB4. For example, we have reported that we probably miss 
relaxases from Archaea and Actinobacteria (11). Secondly, 
proteins evolving too rapidly produce smaller protein fami- 
lies that might miss representatives from the T4SS more evo- 
lutionarily distant from the model systems. Candidates for 
such functions can be fetched using other protein features 
like peptide signals or TMDs (see below). In spite of this, 
we found T4SS in nearly all taxa for which there is a signif- 
icant number of genomes. Profile-profile comparisons also 
showed homology between many of these profiles between 
MPF classes and between taxa (see below). Therefore, we 
believe to have uncovered the majority of protein families 
systematically associated with T4SS. 

Some of the new profiles are not specific for T4SS because 
they match genes lacking a neighboring virB4 more than 
50% of the times. Most of these non-specific profiles match 
proteins typically encoded by mobile genetic elements, like 




MPF classes 

Figure 2. Specificity of the profiles obtained for the different mating-pair 
formation (MPF) classes named I, C, G, T, F, B, FATA and FA (see text for 
details). Black corresponds to the percentage of hits found within — 20/+20 
genes around a virB4 of the corresponding class. Gray corresponds to the 
percentage of proteins found within — 20/+20 genes neighborhood of a 
virB4 of another class. White corresponds to the percentage of proteins 
that were not found to be associated with a VirB4. 

primases or zinc-finger proteins, which are not directly asso- 
ciated with conjugation or T4SS (Supplementary Table SI). 
These profiles may help the characterization of the genetic 
context of the T4SS but they are of little use to the com- 
putational study of these systems. In our genome scans we 
ignored them, except when experimental data showed their 
implication in conjugation. This was the case of relaxases 
and T4CP. Some profiles match proteins encoded by genes 
systematically neighboring virB4. These profiles are specific 
to T4SS, i.e. they contribute to their accurate identification. 
In the vast majority of cases, they also match proteins from 
one single T4SS class (Figure 2) and can therefore be used 
to class T4SS. Out of the 1623 VirB4 hits, 93% co-localized 
with other T4SS-associated profiles in Proteobacteria and 
80% in other taxa. Hence, in the vast majority of cases, 
VirB4 is indeed associated with T4SS. Exceptions may be 
due to ongoing genetic degradation of conjugation systems, 
to the existence of unknown classes of T4SS and/or to co- 
option of VirB4 for other functions. A T4SS is assigned to a 
given class if the direct neighborhood of VirB4 harbors the 
hits of at least three protein profiles from this class. Some 
systems cannot be classed this way because they contain less 
than three hits, or no hits at all, of MPF protein profiles. We 
call them MPFO ('O' for 'others'). They are mostly (75%) 
T4SS loci lacking a neighboring relaxase, thus probably de- 
voted to protein secretion or ongoing genetic degradation. 

The observation that profiles typically match T4SS of one 
single class means that homologous proteins are more simi- 
lar within than between T4SS classes. This can be explained 
either by lack of homology between components of differ- 
ent T4SS classes or by tight co-evolution between these pro- 
teins and VirB4. The latter hypothesis applies to at least cer- 
tain components that have known homologs between T4SS 
classes (e.g. VirB6 or VirBl) (20,37). Hence, homologous 
components exist but are not usually exchanged between 
classes. We used profile-profile comparisons to systemati- 
cally detect significant sequence similarity between protein 
families of different T4SS classes (34). These protein fami- 
lies are similar in sequence and thus they are likely to cor- 
respond to bona fide homologs. Nevertheless, since diver- 
gence between the proteins precludes the use of phyloge- 
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netic methods to test homology, we cannot exclude the pos- 
sibility of convergent evolution. Our analysis pinpointed 57 
relations of similarity between protein families associated 
with different classes of T4SS (Figure 3), in addition to the 
relations between VirB4, T4CP and MOB homologs. In the 
following sections, we describe the protein families identi- 
fied in the different T4SS (Figure 4). We then compare these 
among themselves and with our previously defined set of 
profiles for the VirB system (11). 

MPF F 

We used the F plasmid T4SS as the model of MPF F . This 
system is composed of 18 proteins, some homologous to 
VirB components (reviewed in (38)). We built profiles for the 
three components of the core complex of the T4SS — TraB F , 
TraVF and TraK F (39) — that display structural similarities 
to the VirBlO, VirB9 and VirB7 complex (40). While these 
proteins do seem to have analogous roles, only the profile- 
profile alignments of VirB9 and TraK M pFF and TraVMPFF 
were significant (P-values of 7.6xl0~ 6 and 8.3xl0 -05 
respectively). Profile-profile comparisons of VirBlO and 
TraB M pFF are not significant, whereas they both exhibit the 
same PFAM domain (PF03743, with e-values of 4.8 x 10~ 56 
and 1.3 x 10~ 39 respectively). This is because our long pro- 
files (respectively 375 and 431 positions) only align at the 
region common to the much smaller profile PF03743 (187 
positions). The inner-membrane pore is thought to be com- 
posed of TraGF in interaction with TraL F and TraE F (ho- 
molog to VirB3 and VirB8 respectively). The N-terminal re- 
gion of TraGF is homolog to VirB6 while the C-terminal 
part is involved in mating-pair stabilization (41). TraN F 
interacts with OmpA and LPS moieties during conjuga- 
tion, resulting in mating-pair stabilization (41,42). Its pro- 
file did not match any other profile. TraA F is the only pilin of 
MPF F (43,44), but its rapid evolution precluded the defini- 
tion of a protein profile for the family. We built four different 
profiles for the periplasmic proteins TraW F , TraU F , TraF F 
and TrbCF (45), all homologous according to our profile- 
profile alignments (Figure 3). The inactivation of these pro- 
teins leads to shortened pili (46,47). We also built a profile 
for TraHp that is thought to participate in pilus extension 
(44,45). Overall, we have obtained 12 protein profiles for 
MPF F matching all known essential genes, except TraQ F 
(pilin chaperone) and the fast-evolving TraA F pilin (48,49) 
(and TraX F but this is dispensable (50)). Interestingly, these 
three latter genes interact physically (51,52). 

MPFi 

We used the IncI plasmid R64 conjugative system as the 
model for MPFi. We built 16 protein profiles for MPFi, 
few of which have homologs in other classes. The pro- 
file TraOMPFi is homologous to VirBlO and thus probably 
part of the core complex (53). TraM M PFi is homologous 
to TraE M pFF/VirB8, and thus is probably part of the in- 
ner scaffold. TraJR64 is homologous to VirBll (BlastP e- 
value <10~ 14 ), as reported (54). TraQMPFi and TraR-MPFi 
are homologs and profile-profile comparisons show an in- 
direct relation of homology with the pilin VirB2 (Figure 3). 
We also built profiles for proteins required for conjugation 



both in liquid and in surfaces that are encoded outside of the 
main T4SS operon: TrbA M p F i and TrbB M p F i (28) and for 
TraEMPFi whose function is still unknown (55). TrbBMPFi 
is homolog to TraF M pFF as previously suggested (44). We 
could not obtain profiles for TraJ R 64, TraH R64 , TraS R 64 and 
TraX R6 4 that are not essential for conjugation (28,55). We 
also could not build a specific protein profile for the SogL R6 4 
and SogS R6 4 proteins; they have been reported to be essen- 
tial for conjugation, but not directly involved in DNA trans- 
fer (55). Except these ones, the set of profiles for MPFi in- 
cludes all proteins for which the corresponding gene disrup- 
tion completely abolishes transfer (55). 

MPF G 

MPFq were originally described from ICEHinl056 from 
Haemophilus influenzae (27,56). Here, we used the very 
closely related ICEHinl0810 element as a model because 
it is included in a complete genome sequence. The T4SS is 
encoded in the 24-genes operon tfc. We could find no stud- 
ies on the structure or assembly of this T4SS, but compar- 
ative analyses identified 1 3 genes present in many homolo- 
gous elements and a number of essential genes (27,56,57). 
We built profiles for 18 different proteins of MPFq (Figure 
4). We were unable to retrieve profiles for TfclicEHini08io, 
Tfc4icEHinio8io, Tfc20icEHinio8io and Tfc21icEHinio8io; the 
first two seem non-essential for conjugation (27). The 
core complex of MPFq might resemble the one of MPFp 
since Tfcl3MPFG, Tfcl4 M pFG and Tfcl5MPFG (as well 
as Tfc2 M pFo) are respectively homologs to TraK M pFF, 
TraB M pFF and TraVMPFF- Profile-profile alignments re- 
vealed similarities between MPF G and the inner scaf- 
fold of MPF T and MPF F with Tfcll M PFG, Tfcl8 MP FG 
and Tfcl2 M PFG being homologous to VirB3/TraL M pFF, 
VirB6/TraG M pFF and VirB8/TraE MPFF proteins. MPF G , 
like MPFi, exhibits two VirB2-like pilins, Tfc9MPFG 
and Tfc 10 mpfg- Interestingly, MPF G contains TraW M pFF, 
TraU M pFF, TrbC M pFF and TraF MPFF homologs, although 
they were thought to be specific to MPF F (38). Deletion 
of Tfc24, the homolog of TraWMPFF, results in decreased 
transcription of many of the T4SS genes (27). We could 
also find the only detected homolog to TraH M PFF among 
all T4SS classes: Tfc22 M pFG- Overall, our set includes 18 
protein profiles including nearly all T4SS essential proteins, 
except Tfc20icEHinio8io and Tfc21i C EHinio8io- Profile-profile 
alignments show striking homologies between MPFq and 
MPF F in spite of their evolutionary distance and the use of 
very different pili by the two systems. 

MPF B 

Bacteroides thetaiotaomicron CTnDOT ICE encodes the 
MPF B model system in a 17-genes operon: traA-traQcmooT 
(58). Only one of these proteins (TraGcmDOT) was previ- 
ously found to be homologous to components of the T4SS 
of Proteobacteria (VirB4 protein) (59). The genes traA- 
traDcTnDOT seem more variable than the rest of the system 
when compared with the closely related element CTnERL 
(60); they might thus be non-essential or evolve so fast that 
sequence similarity between distant elements is lost. The 
proteins encoded by traG-NcTnDOT are reportedly essential 
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Figure 3. Relationships of homology between protein families of different mating-pair formation (MPF) classes. Subscript letters correspond to the I, C, 
G, T, F, B, FATA and FA MPF classes (see text for details). Black lines represent direct relationships, i.e. an HHsearch /"-value < 0.001. Dotted wide lines 
correspond to relationships that have been established by structure or sequence similarity, but not by profile alignment. Dotted thin lines represent less 
certain relationships given by profile alignments: the HHsearch score suggests a relation of homology, but the two proteins exhibit different features (e.g., 
domain organization, protein length or presence of specific motifs). White squares represent profiles matching many classes (e.g. VirB4). The color scheme 
used for the boxes correspond to the MPF classes: blue for MPFt, red for MPFp, green for MPFi, yellow for MPFq, cyan for MPFc, black for MPFb, 
orange for MPFfa and purple for MPFfata- 



for conjugation whereas traO-QcmnoT might have a regula- 
tory role (58,60). We obtained profiles for all proteins TraE- 
TraQcTnDOT- The vast majority (99%) of T4SS that we de- 
tected with these profiles is encoded in genomes of the Bac- 
teroidetes phylum. Profile-profile comparisons show that 
MPF B has at least two VirB9 homologs, TraQMPFB and 
TraN MPFB . TraM MPFB matches the VirBlO PFAM domain, 
but not our VirBlO protein profile. As for the homology be- 
tween VirBlO and TraBp, this is because the PFAM domain 
is much shorter and includes more distant homologs. The 
comparisons also revealed that the most-conserved pro- 
teins of the inner scaffold (VirB3, VirB6 and VirB8) have 
homologs in MPF B (respectively TraF MPFB , TraJ M pFB and 
TraK M pFB)- The TraEcmDOT and TralcTnDOT proteins are 
thought to be pilins since we find them to be homolo- 
gous to VirB2 and VirB5 in our analysis. TraO M PFB is ho- 
mologous to TralMPFB, suggesting that MPF B might have 



the peculiarity of encoding three different pilins. We were 
thus able to define 12 protein profiles for MPF B , includ- 
ing most known essential genes of CTnDOT, of which eight 
have homologs in other systems. This analysis suggests that 
this class of T4SS strongly resembles some proteobacterial 
T4SSs. Hopefully, this will facilitate further studies on these 
systems. 

MPFc 

There is some circumstantial evidence of conjugation in 
Cyanobacteria (61), and we have previously identified 
VirB4 homologs in genomes of this taxa (11). Yet we 
were unable to find experimental or computational stud- 
ies on cyanobacterial conjugative systems. We built eight 
protein profiles that were highly MPF c -specific (>95% 
hitting genomes of Cyanobacteria). These hits are sys- 
tematically associated with VirB4 and form compact ge- 
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Figure 4. Representation of the different mating-pair formation (MPF) classes. The length of the arrows is proportional to the mean length of the corre- 
sponding genes. Bold arrows represent genes for which the corresponding protein profile was already available (6,8,11). White arrows represent genes for 
which we did not obtain a profile. Gray arrows represent genes for which we built a protein profile lacking homologs in other classes. A given color (except 
white and gray) corresponds to a single family of homologs. 



netic loci. We used the alpha plasmid of Nostoc sp. PCC 
7120 as a model for MPFc. Profile-profile comparisons 
show that MPFc have two proteins homologous to the 
core complex in other systems (a VirB9/TraVMPFF and 
a VirB10/TraB M pFF homolog). The MPF C inner scaf- 
fold might resemble those of MPFj and MPFf since 
we found homologs to VirB3/TraL M pFF, VirB6/TraK M pFF 



and VirB8/TraE M pFF (Figure 3). Overall, among the eight 
new MPFc protein profiles, five show homologs with com- 
ponents of MPF from Proteobacteria. 
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Figure 6. Representation of the presence/absence of the most-conserved 
protein families along the VirB4 phylogeny as presented in Figure 1 and 
in (1 1). Green shapes represent inferred protein gains, whereas red shapes 
represent protein losses. The colors of the arrows (as shown on the right) 
correspond to those in Figure 4. The bicolor VirD4/TcpA arrow means 
that some MPFfa systems use VirD4 as coupling protein, whereas others 
use TcpA. 



MPFfa 

The 12-gene model conjugative locus of Tn916 is the best- 
described MPF FA system. This class is mostly found in 
Firmicutes and Actinobacteria (62,63). Tn916, ICEBsl of 
Bacillus subtilis and the plasmid pCW3 of Clostridium 
perfringens, all use TcpA as the coupling protein instead 
of VirD4 and encode a peculiar relaxase {orf20 mpffa j a 
MOBt) related to rolling-circle replication initiators from 
plasmids and phages (29,64). The presence of a relaxase 
within the MPF region is unique among the systems we 
considered in this work. These systems do not encode cell 
surface adhesins (5). We built seven specific profiles for 
this system: the TcpA distant homolog of VirD4, and six 
putative components of the T4SS (Figure 4). Orf22]yiPFFA 
and Orf23MPFFA are the only pair of protein families from 
a single system for which we could only build one single 
profile that systematically matches both proteins. As ex- 
pected, given the absence of an outer membrane, our anal- 
ysis suggests that MPF F a lacks homologs to the compo- 
nents of the core complex found in proteobacterial sys- 
tems. Orfl5 M PFFA is the only profile homologous to compo- 
nents of the inner scaffold (VirB6/TraGMPFF) that we could 
identify in this T4SS class. The structure of TcpC p cw3, 
a close homolog to Orfl3Tn9i6, has remarkable similari- 
ties with the one of VirB8 even though sequence similar- 
ity is not significant (65). Hence, MPFfa might have two 
components homologous to the inner scaffold of MPFt — 
Orf 1 5mpffa and Orf 1 3mpffa — and these proteins have been 
shown to interact (66). MPF F a also has a component ho- 
mologous to VirBI: Orfl4 M pFFA- Its homolog TcpG p cw3 
has a hydrolase-like activity on C. perfringens peptidoglycan 
(67), which is consistent with a VirBl-like role in MPFt. We 
could not build a profile for Orfl8-m9i6 matching our speci- 
ficity criteria (less that 50% of the hits neighbored virB4). 
This gene encodes an anti-restriction protein that is proba- 
bly not part of the typical MPF F a (68). The last gene of the 
operon, orf24r n 9i6, is often found in other MPF F a systems 



isolated from the rest of the operon (69), or even not men- 
tioned as part of the conjugative machinery (70). It is thus 
probably not an essential component of MPF F a- 

MPF FA ta 

This class includes systems from an extremely diverse group 
of Prokaryotes (Firmicutes, Actinobacteria, Tenericutes 
and Archaea). These monoderms have very diverse cell 
envelopes — no cell wall in Tenericutes, different lipids in Ar- 
chaea, thick cell walls in Firmicutes and Actinobacteria- 
-and this may have accelerated their diversification. This 
might explain why we could not create a single set of pro- 
tein profiles for all the systems in the class. The lack of ex- 
perimental studies in most MPF FA ta systems further com- 
plicated our task. Nevertheless, we identified proteins asso- 
ciated with four sub-classes of T4SS for which conjugative 
systems have been studied experimentally. These sub-classes 
correspond to monophyletic sub-groups in the MPF F ata 
VirB4 phylogeny. Some of the components have highly sim- 
ilar homologs between sub-classes. 

The 13 -genes operon encoding the conjugative system of 
the pGOl plasmid from Staphylococcus aureus (71,72) was 
used as a model for a sub-class only found among plasmids 
of Firmicutes. We built six profiles specific to genes encoded 
in the operon. The Trsl p ooi and TrsM pG oi profiles were not 
specific, with respectively only 7 and 2% of the hits associ- 
ated to a VirB4. The three other sub-classes were modeled 
from the prg/pcf system of plasmid pCFlO from Enterococ- 
cus faecalis (73,74), from the ICE CTn2 from Clostridium 
difficile (75) and from an ICE of the Streptococcus agalac- 
tiae NEM316 genome (ICESaNEM316) of the ICESa2603 
family (76,77). We did not use ICESa2603 itself as a model 
because it lacks a T4CP and therefore it does not strictly 
fit our definition of an ICE. These three MPFfata sub- 
classes share a core of three proteins (in addition to the 
VirB4 and VirD4 homologs): PrgF pCF io is homologous to 
CD414 C Tn2 and GBS1363 ICE saNEM3i6; PrgI P cFio is homol- 
ogous to CD417cTn2 and GBS1361r C ESaNEM3i6; PrgH pCF io 
is homologous to CD415cTn2 and GBS1362icESaNEM3i6- 
Thus, these nine proteins are recovered by only three 
different profiles. ICESaNEM316-like systems are only 
found within some Streptococcus, and never on plasmids. 
This element also shares an additional profile with the 
Prg/Pcf system, namely PrgL pCF io, which corresponds to 

GBS 1 349icESaNEM3 1 6 • 

Profile-profile comparisons showed no protein pro- 
file in MPF F ata with homologies to the T4SS core 
complex. The exception is the homology between 
PrgC p cFio and VirB9 at the N-terminus of the latter, 
where VirB9 interacts with the inner membrane. The 
VirB9 region interacting with the outer membrane has no 
discernible homologs in MPFfata- On the other hand, 
we found some homologs to components of the inner- 
membrane pore complex: VirB3 homologs (TrsC p ooi, 
PrgI pC Fio/CD417cTn2/GBS1361icESaNEM3i6) and VirB6 

homologS (PrgH p CFlo/CD415cTn2/GBS1362iCESaNEM316). 

Besides, TraM p i P5 oi, homolog of TrsM pG oifor which we 
could not build a profile, is structurally similar to VirB8 
(78). All the MPFfata systems that we describe except 
ICESaNEM316 encode a VirBI homolog (TrsG pG oi, 
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PrgKpCFio, CD419cTn2)- The peptidoglycan-degrading ac- 
tivity of TraG pIP5 oi and PrgK pCF10 , homologs of TrsG pG oi, 
has been shown, confirming the relationship with VirBl 
(79,80). We found no homologs to T4SS pilins in MPFfata- 
This it not unexpected, since these systems are thought to 
encode adhesins to stabilize the mating process (see the 
Introduction section). 

DISCUSSION 

Homology between MPF classes 

Although profiles of one MPF class typically do not match 
proteins from other classes, profile-profile alignments re- 
vealed a number of homologs between classes. Importantly, 
our analyses revealed networks of homology between pro- 
files of different MPF classes (Figure 3). Some of these ho- 
mologies had previously been noticed in comparisons be- 
tween pairs of T4SS (4,5,38,60). Our analysis generalizes 
these results in a common methodological setup. 

Some components of the core complex of T4SS (VirB7, 
VirB9 and VirBlO) are conserved among diderms. This is 
most notably the case for VirB9 (Figure 3), which even has 
two homologs in MPF F and MPFb and three in MPFq. 
VirBlO has homologs in all MPF classes of diderms, includ- 
ing MPFi. VirB7 is a small fast-evolving lipoprotein, which 
may justify why we could not find its homologs in other 
classes. Several profiles from other systems are annotated as 
lipoproteins and could thus be VirB7 analogs (Supplemen- 
tary Table S2). Monoderms lack an outer membrane and 
thus have few homologs to the core complex of diderms. Re- 
markably, PrgGviPFFATA shares some homology with VirB9. 
This protein has no specific attributed function (81 ) but dis- 
plays cell wall anchor motifs and features of other mono- 
derm surface proteins (repeat regions enriched with proline 
and negatively charged residues) (5). Part of the VirBlO pro- 
tein forms the outer membrane pore (40) and this may ex- 
plain its absence from monoderms. 

The three proteins of the inner-membrane pore (VirB3, 
VirB6 and VirB8) are found in almost all MPF classes. 
VirB3 has homologs in every class except MPFi and 
MPFfa- The VirB3 partner ATPase (VirB4) has a dis- 
tant homolog in MPFi, so the presence of an unrecog- 
nized VirB3 homolog within MPFi cannot be excluded. We 
checked for the possibility that TraU could carry region 
homologs to VirB3 and VirB4, since VirB3 and VirB4 fu- 
sions are known to occur in some members of MPFt (82). 
TraU homologs are larger than average VirB4s. However, 
we could not find VirB3 signatures in the TraU homologs. 
VirB3 proteins typically exhibit two TMDs (83). Detec- 
tion of TMD confirmed this for VirB3 and its homologs 
with the exception of TraL M pFF (Supplementary Table S2). 
VirB6 has recognizable homologs in every MPF class ex- 
cept MPFi. This key component of the T4SS has between 
30 and 35 kDa and a high number of TMDs (21). In MPFi, 
we find a protein family (TraY M pFi) with >30 kDa and typ- 
ically more than four TMDs (nine domains in TraYp^). 
This is a good candidate for an analog of VirB6. This hy- 
pothesis is reinforced by the fact that TraY R 64 is the part- 
ner of ExcA in R64 entry exclusion (84), an interaction also 
observed between VirB6 and Eex (85). All the other ho- 
mologs of VirB6 present at least three TMDs (Supplemen- 



tary Table S2). VirB8 is a bitopic protein that shows rec- 
ognizable homologs in all systems of diderms. Interactions 
of VirB8 have been reported with nearly all other compo- 
nents of the T4SS (86). The differences between monoderms 
and diderms in terms of the external structure of the T4SS 
may explain why Orfl3MPFFA (including TcpC p cw3), a pro- 
tein highly similar in structure, has no significant sequence 
similarity with VirB8 (65). PrgL MPF FATA might also be an 
analog of VirB8 (20). All the homologs of VirB8, includ- 
ing Orf 1 3mpffa and the putative analog PrgL M pFFATA, seem 
to be bitopic since they exhibit one single TMD (Supple- 
mentary Table S2). The major pilin (VirB2) has homologs 
in several MPF classes of diderms, with the exception of 
MPF F and MPF C . Unsurprisingly, monoderms, which are 
not known to have pili, lack homologs of VirB2. Finally, 
VirBl, a non-essential cell wall hydrolase found in MPF T , 
has homologs only in the two classes of monoderms. These 
homologous hydrolases are larger than VirBl and they have 
an N-terminus anchored in the inner membrane possibly as 
an adaptation to the absence of an outer membrane and the 
thicker cell wall of many monoderms (20,87). 

Two of the ATPases of the VirB system (VirB4 and 
VirD4) are widespread. VirB4 is thought to be ubiquitous 
in T4SS and our analysis relied on this hypothesis. Impor- 
tantly, very few large (> = 4) clusters of profiles lack VirB4 
(9%). Absence of VirB4 might be due to pseudogenization 
of the conjugation system or sequencing errors. Accord- 
ingly, some of these large clusters matching a large number 
of profiles for a given class include a pseudogene of VirB4. 
VirD4 is also nearly ubiquitous in T4SS loci. This protein is 
only replaced by a different AAA+ ATPase (TcpA FA ) in a 
sub-class of MPFfa (10,11,88). On the other hand, VirBl 1, 
once thought to be the most frequent ATPase of T4SS (89), 
is rarely found outside MPFt (19%). Its homolog in MPFi 
(TraJ M pFi) was shown to be distantly related and closer to 
PilT, the ATPase involved in the retraction of type IV pili 
(90). 

Among the 104 protein profiles used in this study, 34 have 
no homologs. These 'orphan' profiles are not equally dis- 
tributed (gray arrows in Figure 4). Only one is found in 
MPF F , whereas they represent most of the profiles of MPFi. 
This matches the high complexity of MPFi and its early di- 
vergence of MPFi and MPFc from the remaining classes 
(11). Some of these 'orphan' profiles might be distant ho- 
mologs of proteins present in other systems that passed un- 
noticed in our sequence-based analysis. For example, we 
found no homologs among pilins, which evolve very fast in 
sequence, of different classes. Structural data, when it be- 
comes available for the different classes, will help highlight- 
ing these cases and complete the networks of homology of 
Figure 3. 

Evolution of the T4SS gene repertoires 

The analysis of the total number of relationships of ho- 
mology between profiles shows that MPFt is the class 
with the largest number of homologs in other classes 
(23 links) (Figure 5). MPF F and MPF G are highly con- 
nected among themselves suggesting some homology be- 
tween these classes at least at the level of the core complex 
and inner-membrane pore, in spite of not being neighbors 
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Figure 5. Pairwise relations of homology between the protein profiles of 
the different mating-pair formation (MPF) classes named I, C, G, T, F, 
B, FATA and FA as in (1 1). The width of the links is proportional to the 
number of couples of profiles, between two MPF classes or within a single 
MPF class, that have an HHsearch score below 0.001 . 

in the VirB4 phylogeny (11) (Figure 1). MPFc have a num- 
ber of components homologous with other MPF classes 
organized in loci resembling the VirB operon, in spite of 
the large evolutionary distance between MPFt and MPFc 
in the VirB4 phylogeny (11). MPFb has more homologs 
with MPF F (two VirB9) and MPF T (homology to the pilins 
VirB2 and VirB5). Interestingly, MPFt is the least con- 
nected class. This is consistent with its position at the base 
of the VirB4 phylogeny and its use of a distant homolog of 
VirB4, TraU (11). MPFt also has the peculiarity of encod- 
ing two types of pili, one of which being necessary for liquid 
mating and homologous to the type 4 pili (55,91). The sys- 
tems of monoderms, and particularly MPF F a, also have few 
homologs reinforcing the claims that they have fewer com- 
ponents than the other T4SS (20,92), possibly as a result of 
adaptation to monodermy. 

We mapped the patterns of the presence/absence of VirB 
homologs in the phylogenetic trees and drew the most par- 
simonious scenario for the recruitment of the different com- 
ponents to the T4SS (Figure 6). The scenarios for fast- 
evolving proteins must be taken with care because lost se- 
quence similarity may lead to an under-estimation of the 
proteins already present in the last common ancestor of all 
T4SS. Nevertheless, our analysis suggests that a large num- 
ber of components of the VirB systems were present early 
in the history of T4SS. 

Webservers 

We had previously made available a web site (now 
called CONJscan, http://mobyle.pasteur.fr/cgi- 
bin/portal.py#forms::CONJscan-T4SSscan or the short 
URL http://bit.ly/CONJscan) that allows searching for 
T4SS and conjugation-related protein profiles. In the 
present work, we increased the number of profiles available 



for search from 45 to 116. Our previously identified T4SS 
and relaxases were distributed as a simple text file. This 
made its analysis difficult with no intuitive or user-friendly 
way to search or filter the results according to different 
criteria. We have now created a web site, CONJdb, which 
allows searching and browsing our extended dataset thanks 
to a graphical interface. Our web site also implements 
the classification scheme. Finally, the graphical interface 
allows to performs searches by bacterial species, by T4SS 
class, or according to the presence or absence of T4SS 
components, namely the T4SS, the coupling protein and 
the relaxase. This should facilitate the discrimination of 
T4SS dedicated to conjugation from T4SS dedicated to 
protein secretion. We are working on the development of a 
stand-alone application that will be distributed and should 
be of use to analyze large metagenomic datasets with local 
computational resources. In the future, CONJdb will be 
regularly updated and linked with other genomic databases. 
The web site is accessible at http://conjdb.web.pasteur.fr. 

The present release of CONJdb has the results of the 
analysis of 2393 chromosomes and 1813 plasmids for a total 
of 2262 complete prokaryotic genomes. Plasmids sequenced 
without the corresponding chromosome were not included. 
We detected 947 conjugative systems (up from 515), 1181 
mobilizable elements (up from 595) and 646 T4SS lacking 
nearby relaxases (up from 243), for a total of 1623 T4SS. 
The most recent and comprehensive T4SS database to date, 
SecReT4 (93), contains 811 T4SS present on 638 different 
replicons. SecReT4 does not contain Cyanobacteria or Ar- 
chaea T4SS. Another tool, AtlasT4SS (94), contains infor- 
mation from 70 genomes (58 Bacteria, 1 Archaea and 11 
plasmids) and contains 134 clusters of orthologs. Impor- 
tantly, CONJdb includes information on the T4SS class 
and on the putative T4SS function. Finally, because pro- 
teins are annotated using HMM profiles, instead of blast- 
like searches in a large databank of sequences, CONJscan 
and CONJdb provide standard sequence annotations. We 
hope this study and these web resources will be useful for 
scientists studying T4SSs and particularly for those engag- 
ing in the study of poorly known ones. 
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Supplementary Data are available at NAR Online. 
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